NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Improving Storage Systems Using Machine Learning

https://doi.org/10.1145/3568429

Akgun, Ibrahim Umit; Aydin, Ali Selman; Burford, Andrew; McNeill, Michael; Arkhangelskiy, Michael; Zadok, Erez (February 2023, ACM Transactions on Storage)

Operating systems include many heuristic algorithms designed to improve overall storage performance and throughput. Because such heuristics cannot work well for all conditions and workloads, system designers resorted to exposing numerous tunable parameters to users—thus burdening users with continually optimizing their own storage systems and applications. Storage systems are usually responsible for most latency in I/O-heavy applications, so even a small latency improvement can be significant. Machine learning (ML) techniques promise to learn patterns, generalize from them, and enable optimal solutions that adapt to changing workloads. We propose that ML solutions become a first-class component in OSs and replace manual heuristics to optimize storage systems dynamically. In this article, we describe our proposed ML architecture, called KML. We developed a prototype KML architecture and applied it to two case studies: optimizing readahead and NFS read-size values. Our experiments show that KML consumes less than 4 KB of dynamic kernel memory, has a CPU overhead smaller than 0.2%, and yet can learn patterns and improve I/O throughput by as much as 2.3× and 15× for two case studies—even for complex, never-seen-before, concurrently running mixed workloads on different storage devices.
more » « less
Full Text Available
Predicting Network Buffer Capacity for BBR Fairness

Akgun, Ibrahim "umit"; Vargas, Santiago; Arkhangelskiy, Michael; Burford, Andrew; McNeill, Michael; Balasubramanian, Aruna; Gandhi, Anshul; Zadok, Erez (December 2022, 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Workshop on ML for Systems)

BBR is a newer TCP congestion control algorithm with promising features, but it can often be unfair to existing loss-based congestion-control algorithms. This is because BBR's sending rate is dictated by static parameters that do not adapt well to dynamic and diverse network conditions. In this work, we introduce BBR-ML, an in-kernel ML-based tuning system for BBR, designed to improve fairness when in competition with loss-based congestion control. To build BBR-ML, we discretized the network condition search space and trained a model on 2,500 different network conditions. We then modified BBR to run an in-kernel model to predict network buffer sizes, and then use this prediction for optimal parameter settings. Our preliminary evaluation results show that BBR-ML can improve fairness when in competition with Cubic by up to 30% in some cases.
more » « less
Full Text Available
Ookami: Deployment and Initial Experiences

https://doi.org/10.1145/3437359.3465578

Burford, Andrew; Calder, Alan; Carlson, David; Chapman, Barbara; Coskun, Firat; Curtis, Tony; Feldman, Catherine; Harrison, Robert; Kang, Yan; Michalowicz, Benjamin; et al (July 2021, PEARC '21: Practice and Experience in Advanced Research Computing)
null (Ed.)
Ookami [3] is a computer technology testbed supported by the United States National Science Foundation. It provides researchers with access to the A64FX processor developed by Fujitsu [17] in collaboration with RIKΞN [35, 37] for the Japanese path to exascale computing, as deployed in Fugaku [36], the fastest computer in the world [34]. By focusing on crucial architectural details, the ARM-based, multi-core, 512-bit SIMD-vector processor with ultrahigh-bandwidth memory promises to retain familiar and successful programming models while achieving very high performance for a wide range of applications. We review relevant technology and system details, and the main body of the paper focuses on initial experiences with the hardware and software ecosystem for micro-benchmarks, mini-apps, and full applications, and starts to answer questions about where such technologies fit into the NSF ecosystem.
more » « less
Full Text Available

Search for: All records